Complete chloroplast genomes of Cerastium alpinum, C. arcticum and C. nigrescens: genome structures, comparative and phylogenetic analysis

The genus Cerastium includes about 200 species that are mostly found in the temperate climates of the Northern Hemisphere. Here we report the complete chloroplast genomes of Cerastium alpinum, C. arcticum and C. nigrescens. The length of cp genomes ranged from 147,940 to 148,722 bp. Their quadripartite circular structure had the same gene organization and content, containing 79 protein-coding genes, 30 tRNA genes, and four rRNA genes. Repeat sequences varied from 16 to 23 per species, with palindromic repeats being the most frequent. The number of identified SSRs ranged from 20 to 23 per species and they were mainly composed of mononucleotide repeats containing A/T units. Based on Ka/Ks ratio values, most genes were subjected to purifying selection. The newly sequenced chloroplast genomes were characterized by a high frequency of RNA editing, including both C to U and U to C conversion. The phylogenetic relationships within the genus Cerastium and family Caryophyllaceae were reconstructed based on the sequences of 71 protein-coding genes. The topology of the phylogenetic tree was consistent with the systematic position of the studied species. All representatives of the genus Cerastium were gathered in a single clade with C. glomeratum sharing the least similarity with the others.


Synonymous (Ks) and non-synonymous (Ka) substitution rate analysis
The substitution rate varied across genes in each functional group and ranged from 0 to 0.151 and from 0 to 0.0858 for Ka and Ks, respectively (Supplementary Table S4).The highest average value of Ka (0.0062) was noted in the group of "other genes" and the lowest (0.0012 and 0.0014) in genes related to the cytochrome b/f complex and photosystem II, respectively.The highest average value of Ks (0.0247) was noted in gene for RubisCO large subunit, and the lowest in genes associated with the small subunit of ribosome (0.0159) and subunits of ATP synthase (0.0160).In summary, no differences (Ka = 0 and Ks = 0) were observed in the sequences of 11 genes, whereas only synonymous substitutions (Ka = 0) were observed in 18 genes.The Ka/Ks ratio was less than 1 in all genes, excluding ndhB (2.7250 for C. arvense).Relatively high values of Ka/Ks were observed in rpl22 (0.8673) for all studied species and in rps14 (0.8776) for C. arvense.In the remaining cases, the values did not exceed 0.75 (Fig. 5).

Genomic comparative and nucleotide diversity analyses
The MAUVE results revealed a highly conservative structure of chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, and C. arvense for which no rearrangements (inversions or translocations) were detected.Only in the case of C. glomeratum the opposite orientation of the whole SSC region was observed (Supplementary Fig. S2).

Prediction of RNA editing sites
Prediction of RNA editing sites with the use of PREPACT 3.0 tool revealed from 578 to 588 editing sites in 63 protein-coding genes (Fig. 5, Supplementary Table S5A-D).The lowest number of predicted RNA editing sites (578) was found for C. alpinum and C. nigrescens, whereas the highest was for C. glomeratum.In the case of the C. arcticum the number of RNA editing sites was 583.Among identified editing events both C to U and U to C conversions were found.In the case of 14 genes no such changes were identified.The C to U conversion accounted for 43.05% to 43.54% of total RNA editing sites, while U to C substitutions were responsible for 56.46% to 56.95% of the identified editing events.All predicted RNA editing sites resulted in non-synonymous mutations.Forty-seven (47.17-47.28%)percent of the substitutions were found at the first position of the codon, 53% (52.72-52.83%)were found at the second position, and none were found at the third position.Among predicted RNA editing events there were also conversions that involved two sites of RNA editing within one codon.Eighteen such editing events were identified in the case of C. alpinum and C. nigrescens, and 20 for C. arcticum and C. glomeratum.Most of these events involved conversions of UCU and UCC codons for serine (S) into CUU and CUC triplets for leucine (L) and back from leucine to serine, and also conversion of UUU and UUC for phenylalanine (F) to CCU and CCC for proline (P), and in the opposite direction i.e., from proline to phenylalanine.The highest number of predicted RNA editing sites were reported for ycf1 (85-88), ycf2 (77), and rpoC2 (64-65) genes.The most often substitution in each species was phenylalanine (F) to leucine (L) change (16.48-16.75%),whereas P (proline) to F (phenylalanine) and R (arginine) to W (tryptophan) changes were observed with the lowest frequencies (0.353-0.358% and 0.881-0.896%,respectively).Additionally, the conversion of the termination codon UAA to CAA triplet encoding glutamine was found to be created by RNA editing in ndhI gene for C. arcticum.Additionally, we conducted the same investigation for chloroplast genes of C. arvense.Unfortunately, due to incomplete sequences available for rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2, these genes were not included in the analysis.In 17 out of 71 analyzed genes, we did not identify potential RNA editing sites.In the remaining 54 genes we found 286 editing sites (Fig. 5, Supplementary Table S5E).Further, for this species, both C to U and U to C conversions were found, but U to C edition dominated (56.46%).The highest number of substitutions were observed for the first (53.06%)and the second (46.94%) position of the codon, whereas they were absent in the third position.Analogous to the situation described above for C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum also here, for C. arvense, among predicted RNA editing events we found conversions that involved two sites of RNA editing within one codon.There were seven situations in which CUU and CUA codons for www.nature.com/scientificreports/leucine (L) were changed into UCU and UCA for serine (S), and backward from serine to leucine.The highest number of predicted RNA editing sites were identified within sequences for matK (40) and ndhF (37) genes.All the identified RNA edition events caused non-synonymous mutations.The change from phenylalanine (F) to leucine (L) was the most abundant substitution (18.53%), whereas leucine (L) to proline (P) and arginine (R) to cysteine (C) were observed with the lowest frequency (0.7%).

Phylogenetic analysis
In the BI tree, a very high Bayesian posterior probability value (≥ 0.

Discussion
Chloroplast genomes are a relevant resource for many genomic and biotechnological applications 31 .Its unique features, like lack of recombination and slower mutation rate in comparison to nuclear genomes, make the chloroplast genome a frequently used source of data in evolutionary biology 32 .Moreover, common use of chloroplast genome in phylogeographic studies is observed due to its uniparental inheritance that exhibits geographical structure 33 .
Although the genus Cerastium consists of more than 200 species 1 , the availability of the genomic data for this group of plants is very limited and, to date there is only one complete chloroplast genome sequence in the NCBI database for C. glomeratum.There is also a chloroplast genome sequence for another Cerastium species (C.arvense), but due to the several gaps in the intergenic spacers and lack of complete sequences for six proteincoding genes (rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2), create constraints for the utilization of this sequence.To fill the gap in the knowledge concerning the genomics of the genus Cerastium we sequenced and annotated the plastid genomes of three species: C. alpinum, C. arcticum, and C. nigrescens.The size of reported cp genomes ranged from 147,940 (C.nigrescens) to 148,722 bp (C.arcticum) and was similar to the plastome of C. glomeratum (148,643 bp) and other angiosperms 34 .All three studied cp genomes share the same gene content and order and typical quadripartite structure, with a pair of inverted repeats (IR) separated by a SSC and a LSC region).Length variation in cp genomes in different groups of plants is often caused by expansions and contractions of IR regions 35 .In extreme cases, IR regions were completely lost by chloroplast genomes of some algae 36 or one of its copies is not observed in some representatives of leguminous plants 37 .Consequently, the analysis of the distribution of IR/LSC and IR/SSC borders became a standard element of plastome characteristics.Obtained results revealed that their locations may differ among various species, even between closely related genera 38 .Analysis of reported here chloroplast genomes of C. alpinum, C. arcticum, and C. nigrescens revealed that IR/LSC and IR/SSC boundaries were located within sequences of ycf1 and rps19 genes (Fig. 2), which is analogous to the situation observed for most angiosperms 39 .The location of IR boundaries was identical for C. alpinum and C. nigrescens, whereas a minor shift (two bases shift within rps19 and three bases within the ycf1 gene) was observed for C. arcticum.The length of IR and SSC regions in reported plastomes was very similar and ranged from 25,507 to 25,513 bp and from 16,850 to 16,861 bp, respectively.Higher variation was found for the LSC region where the difference between the longest and the shortest LSC is 782 bp (C.arcticum vs. C. nigrescens).Nevertheless, the sizes of all three plastome regions values are consistent with previous reports for other dicotyledons 40,41 .For comparative purposes, the IR borders within the chloroplast genome of C. glomeratum were also examined.In this case, more differences were observed.Although the IR borders were also located within the rps19 and ycf1 www.nature.com/scientificreports/genes, the eleven base shift for rps19 and 45 base shift for ycf1 was found.Additionally, only one copy of ycf1 can be found within C. glomeratum plastome at the IR B /SSC border as its incomplete copy (ψycf1) between IR A / SSC was not annotated.However, the main difference is associated with the opposite orientation of the whole SSC region.This interesting phenomenon was originally reported for Phaseolus vulgaris 42 .The author with the use of restriction enzyme analysis revealed, that the individual plants' chloroplast DNA demonstrates a type of heteroplasmy in which the plastomes occurs in two equimolar states (i.e., inversion isomers) that differ in the orientation of the SSC region.Later this phenomenon was confirmed in other species, e.g., Heterosigma akashiwo 43 , Lasthenia burkei 44 , and Artemisia frigida 45 .Chloroplast genomes of C. alpinum, C. arcticum, and C. nigrescens contained an identical set of 113 genes which appeared to be identical with C. arvense.In the case of the cp genome of C. glomeratum lack of the psbL gene was noticed during the analyses, but reannotation of the plastome allowed us to identify the psbL sequence between psbJ and psbF genes.Furthermore, two additional genes, i.e., infA (coding translation initiation factor I) and rpl23 (encoding ribosomal protein L23) were not annotated in C. glomeratum plastome.Detailed analysis of the chloroplast genome for the species enable identification of these sequences, but their pseudogenization (i.e., the presence of internal, premature termination codons) was the most probable reason why their annotations were not considered by the original authors of the sequence.In the case of C. alpinum, C. arcticum, and C. nigrescens rpl23 was also identified as a pseudogene, whereas a complete sequence of infA gene was found and annotated.Loss of the infA gene was also observed in other species within the Caryophyllales 46 .In some cases, the infA gene was found to be a pseudogene, i.a. in Nicotiana tabacum 47 , Arabidopsis thaliana 48 , Oenothera elata 41 and several Allium species 49 .In the chloroplast genomes of another Caryophyllaceae representative, i.e.Dianthus superbus var.longicalyncinus, both infA and rpl23 were retained as pseudogenes 50 .Pseudogenization of the rpl23 gene was also previously reported in various species, i.a.within the genus Triticum 51 , Hordeum 52 and Secale 53 .The studied Cerastium cp genomes had a GC-content of 36.46-36.52%,which is comparable with other Caryophyllaceae -36.32% in Dianthus caryophyllus 54 , 36.4% in Silene jenisseensis 55 , 36.5% in Pseudostellaria palibiniana 56 , P. okamotoi 57 , P. heterophylla 58 , P. longipedicellata 59 and Gymnocarpos przewalskii 60 and 36.7% in Colobanthus quitensis 61 .
The repeat regions of the genomes are of particular importance in sequence rearrangement and recombination 62 .The genomic repeats identified within chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, and C. glomeratum ranged from 30 to 170 bp in length and they were identified predominantly (56.3-69.6%)within non-coding regions.Similar values were reported in other Caryophyllaceae, such as C. quitensis (53.3% 63 ;), Silene capitata (56.0%) and Lychnis wilfordii (69.2%) 64 .The majority of the repeats (78-90%) in all four Cerastium genomes are between 30 and 40 bp in length.Similar values were reported in other angiosperms-legumes (Glycine, Lotus, Medicago 65 ) and cotton (Gossypium hirsutum 66 ).
Chloroplast simple sequence repeats, or microsatellites, are repetitive genomic elements that typically consist of tandemly repeated multiple copies of mono-to hexanucleotide motifs which are usually found in the noncoding regions 67 .Due to their high abundance, random distribution within the genome and high polymorphism information content, they are also widely used for high-throughput genotyping 68 .These markers proved their usefulness in population genetics and evolutionary studies 69,70 .In the analyzed plastomes of four Cerastium species, the mononucleotide (A/T) repeats were the most abundant SSR motif (36.4-43.5%).The dominance of mononucleotide chloroplast SSRs has been also observed in other Caryophyllaceae, where it ranged from 44.8% in Colobanthus apetalus 63 or 55.3% in C. lycopodioides 71 up to 76.8% in Silene capitata or 77.6% in Lychnis wilfordii 64 .In turn, di-(AT/TA), penta-(AATAT/TATAA) and hexanucleotide (AAA TCC /CCT AAA ) microsatellites were least abundant, and only one such element was identified in C. glomeratum, C. alpinum, and C. arcticum, respectively.
The synonymous (Ks) and non-synonymous (Ka) substitution rate and their ratio (Ka/Ks) are important parameters in gene evolution studies 72 .Generally, in most of the coding regions synonymous nucleotide substitutions dominate over non-synonymous changes 73 .This was also observed in our study, where Ks values dominated over Ka which resulted in high sequence conservation.Nevertheless, there were also sequences for which considerable variation was found due to the high Ka values.The highest Ka values were observed for rpl32 (average Ka = 0.0151) and matK (average Ka = 0.0134).High variation of the matK sequence has been widely documented and it is recognized as one of the most promising barcoding sites for systematic and evolutionary studies in plants 74,75 .There are also studies reporting high genetic diversity in the immediate vicinity of the rpl32 gene (ndhF-rpl32 or rpl32-trnL) 76,77 and the role of rpl32 gene in the evolution of chloroplast genomes which involve its complete loss, substitution or transfer to the nucleus (for review see 78 ).Assessment of the ratio of nonsynonymous (Ka) to synonymous (Ks) substitution is widely accepted approach used to infer about the direction of the sequence evolution at the protein level (Ka/Ks > 1 indicates a positive selection, Ka/Ks < 1 indicates a negative or purifying selection, whereas Ka/Ks = 1 indicates a neutral evolution) 79,80 .Protein functions are maintained through purifying selection, whereas positive selection favors new gene variants which may be beneficial for organism adapting to changing environmental conditions.In the case of our study, Ka/Ks ratio of all genes was less than 1, except for ndhB (2.7250 for C. arvense), implying that this gene evolved at a faster rate and underwent positive selection.The same pattern of selection (Ka/Ks > 1.0) for ndhB gene was also reported in various species representing the family Gentianaceae (Gentiana lawrencei 81 ), Orchidaceae (Calanthe delavayi 82 ) and Cupressaceae (Cupressus and Juniperus species 83 ).The group of ndh genes, encoding subunits of NADH dehydrogenase, play a key role in the use of light energy and electron transfer chain to produce ATP, an essential component for photosynthesis 84 .Chloroplast NADH dehydrogenase is sensitive to strong light stress and can protect plants from photoinhibition or photooxidation stress by stabilizing the NADH complex and preventing drought-related declines in photosynthetic rate and growth delay 85 .These observations may suggest that NADH dehydrogenase genes are involved in adaptation to environmental stresses by optimization of photosynthesis.An excess of functionally adaptive amino acid substitutions within NADH dehydrogenase genes was described www.nature.com/scientificreports/previously for Poaceae 86 .Authors observed there the signals of positive selection acting on one-third of all chloroplast protein-coding genes (25 out of 76), including nine of the eleven genes encoding subunits of NADH dehydrogenase.In the case of our study, the signal of positive selection detected for the ndhB gene in C. arvense which might be interpreted as one of the mechanisms of physical adaptation which enabled this cosmopolitan species to colonize vast areas of Europe and North America.Highly variable sequences found within chloroplast genomes appeared as a common source of molecular markers suitable for phylogenetic analyses and species identification 87 .Although traditional barcoding chloroplast regions, like matK, rbcL or intergenic spacer trnH-psbA revealed lower than expected genetic diversity, our genome-wide comparative analysis of plastomes of four Cerastium species (C.alpinum, C. arcticum, C. nigrescens, and C. glomeratum) allowed us to identify nine fast evolving regions.Among these divergent hotspots (π > 0.015) there were seven regions (trnD-GUC -trnY-GUA, trnF-GAA-ndhJ, ndhC-trnV-UAC, petA-psbJ, psbE-petL, trnP-UGG -psaJ and intron within rps16 sequence) located within LSC region and two others (rpl32-trnL-UAG and intron within ndhA sequence) identified within SSC region.To the best of our knowledge, none of these chloroplast genome regions have been used to date for phylogeny reconstruction within the genus Cerastium.Nevertheless, there are several phylogenetic studies performed within various groups of plant species, including the family Caryophyllaceae, in which at least some of the regions listed above were used, e.g.intron of rps16 88 , petA-psbJ 89 or rpl32-trnL-UAG 90 .
RNA editing is one of the most important post-transcriptional modifications which mainly occurs in mitochondrial and chloroplast transcripts 91,92 .RNA editing is described as a process involved in the correction of a missense mutation of genes at the RNA level.This mechanism could alter the nucleotide sequence through insertion, deletion, or substitution of nucleotides 93,94 to preserve the function of encoded proteins 95 .The first report of RNA editing was documented for the cox2 gene in the protozoan parasite Trypanosoma brucei 96 , whereas in plants RNA editing was first discovered in the sequence of cox2 of Triticum aestivum 97 and then in rpl2 in maize 98 .Several editing sites have been reported in many other species, i.a. A. thaliana 93 , N. tabacum 99 , Oryza sativa 100 , Pisum sativum 101 and Manihot esculenta 102 .RNA editing that converts cytidine into uridine (C into U) is widespread in plant organelles and occurs mostly at the first or second positions of codons 103 .Whereas the reverse U to C conversions is more restricted in occurrence.In studied Cerastium species the presence of both C to U and U to C editing has been revealed.RNA editing by U to C is rather rare in terrestrial plants, but it has been found in some species i.a. A. thaliana 104 , hornworts 105 , lycophytes 106 and ferns 107 .
One of the plant groups that has been intensively studied in terms of its phylogeny is the family Caryophyllaceae.Traditionally, Caryophyllaceae was divided into three subfamilies: Alsinoideae, Caryophylloideae, and Paronychioideae 108 .However, the traditional taxonomy of the family encountered many difficulties, i.e., most of the genera appeared to be polyphyletic probably because many of the morphological characters evolved in parallel 109 .More recently, a new classification of Caryophyllaceae family based on three chloroplast regions (matK, trnL-trnF, and rps16) was proposed which divided this group into 11 tribes 110 .Unfortunately, only two Cerastium species (C.arvense and C. fontanum) were represented in this study and based on their molecular characteristics they were nested within the Alsineae tribe, together with representatives of the following genera: Stellaria, Pseudostellaria, Myosoton, Plettkea, Holosteum, Moenchia, and Lepyrodictis.Cerastium is one of the Caryophyllaceae genera whose structure is still intensively debated.Even determining the number of species distinguished within this group of plants is problematic and vary from 60 111 or 100 3,112 up to 200 113 species.Phylogenetic analyses employing multiple nuclear and plastid DNA sequences have established Cerastium's monophyly 13,114 .However, there are still some issues associated with Cerastium systematics that need clarification, for example, the status of the C. alpinum-C.arcticum complex which includes C. alpinum, C. arcticum, and C. nigrescens.Several evolutionary lineages were identified within that complex in earlier research based on morphology, isozymes, and DNA markers 6,10,19 .It was reported that the origin and evolution of these taxa are most likely related to the fluctuations of ice sheet range during the Quaternary glaciations which caused the extensive migrations of the species and enabled multiple hybridization and introgression events 11,19,115 .This hypothesis is consistent with the results of studies reporting no variation in chloroplast trnL-trnF and psbA-trnH sequences among representatives of the arctic-alpine C. alpinum-C.arcticum complex and members of the boreal-temperate C. tomentosum and C. arvense groups 13 .
In our study, phylogenetic analysis was based on 71 concatenated protein-coding gene sequences.Revealed phylogenetic relationships between analyzed representatives of the Caryophyllaceae family were in concordance with the taxonomic position of studied species and previous phylogenies of this group 109,116 .Moreover, obtained results allowed us to undoubtedly discriminate all analyzed species, including five representatives of the genus Cerastium (C.alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense).This is in agreement with the previous observation that a phylogenetic network that combines several genes is preferable to a single-gene tree, as the latter is typically insufficient to reveal reliable phylogenetic relationships 117 .All Cerastium species were gathered in one clade, but C. glomeratum appeared to be the most divergent from the other species.
Our divergence time analysis confirmed the results of the previous studies on molecular and temporal diversification of the Caryophyllace family.Analogous to the results of the latest research based on nuclear ITS region and four plastid sequences (matK, rbcL, rps16 and trnL-F) 118 our studies suggested that the family Caryophyllaceae began to diversify before the end of Crecateous (ca.74.46 Mya) and this process continued through the Paleogene and Neogene with the highest intensity of the diversification in the last 10 Mya 119 .According to our observations Alsineae tribe, which includes the genus Cerastium, started to diversify at 20.6 Mya, whereas the beginning of that process for the genus Cerastium was dated on ca.3.66 Mya.Our results suggested that C. glomeratum split earliest from the other representatives of this genus, whereas the other species appeared to be on the early stages of diversification.The high similarity of studied Cerastium plastome sequences may be treated as possible evidence for weak barriers to breeding between these species which enabled spontaneous hybridization between them.Previously, interspecific hybridization events were reported for many Cerastium species 8,120 .

Conclusion
The chloroplast genomes of Cerastium alpinum, C. arcticum, and C. nigrescens were sequenced and characterized for the first time.The reported chloroplast genomes appeared to be highly conserved in terms of the gene content and order as well as their quadripartite structure.Highly divergent regions (rpl32-trnL-UAG, ndhA intron, rps16 intron, trnD-GUC -trnY-GUA, trnF-GAA-ndhJ, ndhC-trnV-UAC, petA-psbJ, psbE-petL and trnP-UGG -psaJ) and microsatellite sequences that could be potentially used as markers in genetic diversity or phylogenetic studies were identified.Reconstruction of phylogenetic relationships within the family Caryophyllaceae confirmed the previously reported systematic relations within that group of plants and supported the position of Cerastium species as a separate clad within the tribe Alsineae.Although obtained data provide insight into the evolution and biogeographic history of the genus Cerastium further studies are needed to finally elucidate the relationships between species from the C. alpinum-C.arcticum complex.S3), C. arcticum (Supplementary Fig. S4) and C. nigrescens (Supplementary Fig. S5).

Plant material, DNA extraction and chloroplast genome sequencing
Total genomic DNA was extracted from the fresh or dried material of a single plant using Maxwell 16 LEV Plant DNA Kit (Promega, Madison, WI).The amount and purity of isolated DNA was estimated spectrophotometrically (NanoDrop ND-1000 UV/Vis; NanoDrop Technology).Additionally, the quality of DNA was verified on 1.5% (w/v) agarose in the presence of 0.5 µg/ml ethidium bromide (wavelength 300 nm; Ultra-Lum EB-20 Electronic UV Transilluminator).
The appropriate genome libraries (library kit: TruSeq DNA PCR Free (350), prepared from high-quality genomic DNA, were sequenced on Illumina NovaSeq6000 platform (Illumina Inc., San Diego, CA, USA) with a 150 bp paired-end read.

Annotation and genome analysis
The quality of raw reads was checked with the FastQC tool.Raw reads were trimmed (5 bp of each read end, regions with more than 5% probability of error per base) and mapped to the reference chloroplast genome of C. glomeratum (NC_066897) using Geneious v.R7 software 124 with medium-low sensitivity settings.The details on subsequent procedures for chloroplast genome assembly and annotation were described in our previous study 78 .The chloroplast genomes were annotated using PlasMapper 125 with manual adjustment and circular maps of www.nature.com/scientificreports/chloroplast genomes were drawn using the OrganellarGenome DRAW tool 126 .Each chloroplast genome assembly was validated using GetOrganelle v.1.7.7.0 127 .Additionally, to check for the possible presence of heteroplasmy variant calling analysis was performed in Geneious software using "Find Variations/SNPs (Single Nucleotide Polymorphism)" feature with the following parameters: minimum variant frequency = 0.1; minimum coverage = 10, p-value cut off = 0.0001 and default values for the remaining parameters.

Genomic repeats and SSR analysis
The genomic repeats, including forward, reverse, palindromic and complementary sequences were identified using REPuter software 128 with the following settings: minimal repeat size of 30 bp, Hamming distance of 3, and 90% sequence identity.Chloroplast simple sequence repeats (SSR), also called microsatellites, were identified in Phobos v.3.3.12 129 .Only perfect SSRs with a motif size of one to six nucleotide units were considered.Additionally, we applied the standard thresholds for chloroplast SSRs' identification 130 : minimum number of repeat units were set to 12, 6, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-and hexanucleotides, respectively.A single IR region was used to eliminate the influence of doubled IR regions, and redundant results were deleted manually.

Comparative analysis of chloroplast genomes
Chloroplast genome sequences of three Cerastium species (C.alpinum, C. arcticum, C. nigrescens) reported in this paper and plastome sequence of C. glomeratum (NC_066897) and C. arvense (MH627219) acquired from NCBI database were used for the genome synteny analysis which was performed with the use of MAUVE v.1.1.1 131 .Furthermore, the sequences were aligned in MAFFT v.7.310 132 to perform sliding window analysis and evaluate nucleotide diversity (π) in chloroplast genomes using DnaSP v.6.10.04 133 .The step size was set to 50 base pairs, and the window length was set to 800 base pairs.Here, only complete chloroplast genome sequences were used-C.arvense plastome which has several gaps in its sequence was excluded from this analysis.The results were visualized with the CIRCOS software package v.0.69-9 134 .
The selective pressure for genes identified in chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense was also analyzed.A total number of 77 protein-coding genes were selected for which synonymous (Ks) and non-synonymous (Ka) substitution rates, as well as Ka/Ks ratio, were estimated using DnaSP v.6.10.04.Cerastium glomeratum chloroplast genome was used as a reference.During the analyses, lack of psbL gene was noticed in C. glomeratum.Reannotation of the C. glomeratum plastome allowed us to identify the sequence for this lacking gene in its traditional position i.e., between psbJ and psbF (detailed location: 61391..61507).In the C. arvense cp genome all genes which were annotated in plastomes of C. alpinum, C. arcticum, C. nigrescens were also present, but unknown nucleotides (n) were recorded in six (rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2), therefore these sequences were excluded from calculations for this species.The results were visualized with the CIRCOS software package v.0.69-9 134 .
The junction sites between LSC, SSC, and IRs regions were also identified and compared.Additionally, data on the codon usage distribution was acquired from the Geneious v.R7 statistic panel.

Prediction of RNA editing sites
Potential RNA editing sites in the protein-coding genes from chloroplast genomes of C. alpinum, C. arcticum, C. nigrescens, C. glomeratum, and C. arvense were predicted using PREPACT 3.0 tool 135 .Arabidopsis thaliana (NC_000932) was used as a reference for BLASTx prediction, both forward (C to U) and reverse (U to C) editing options were selected, while the remaining settings were kept at default (0.001 e-value cutoff and 30% filter threshold).In the case of C. arvense rpl20, rpoB, rpoC1, rpoC2, ycf1, and ycf2 genes were excluded from the analysis as unknown nucleotides (n) were recorded in their sequences.The results were visualized with the CIRCOS software package v.0.69-9 134 .

Phylogenetic analysis
Chloroplast genome sequences of three Cerastium species (C.alpinum, C. arcticum, and C. nigrescens) reported in this paper, as well as 56 plastomes of other representatives of family Caryophyllaceae (including C. glomeratum and C. arvense) and A. thaliana (outgroup), were used for phylogenetic analysis (Table 3).Initially, the sequences of 71 protein-coding genes shared by all these species were extracted using a custom R script.Then, the concatenated sequences of 71 genes were aligned in MAFFT v7.310 and used for phylogeny reconstruction by Bayesian Inference (BI).The Mega v.7 software 136 was used to determine the best-fitting substitution model, and the GTR + G + I model was selected.The BI analysis was conducted using MrBayes v.3.2.6 137,138 , according to the parameter's settings described in our previous paper 63 .The obtained phylogenetic tree was used as a starting tree for divergence time analysis performed using RelTimeML feature in MEGA 7 with GTR model.The divergence time between Cerastium arvense and Myosoton aquaticum (6.2-38.1 Mya), Arenaria serpyllifolia and Pseudostellaria japonica (20.3-83.4Mya) and Dianthus chinensis and Silene latifolia (20.3-46.7 Mya) obtained in TimeTree 139 were used as calibration constraints in calculations.

Ethics declarations
Authors confirm that the use of plants in the present study complies with international, national and/or institutional guidelines.

Figure 1 .
Figure 1.Gene map of the three Cerastium chloroplast genomes.Genes drawn inside the circle are transcribed clockwise, and those outside are transcribed counterclockwise (indicated by arrows).Differential functional gene groups are color-coded.GC content variations is shown in the middle circle.

Figure 3 .
Figure 3. Number of repeat types and their distribution in four Cerastium species.(a) Length of the repeats; (b) types of repeats; (c) location of repeat sequences.F, P, R represent forward, palindromic and reverse repeats.

Figure 4 .
Figure 4.The distribution and type of simple sequence repeats (SSRs) in cp genomes of four Cerastium species.(a) Number of different SSRs types; (b) distribution of SSR motifs in different repeat class types; (c) location of different SSRs in IR, SSC and LSC regions; (d) partition of SSRs among IGS, introns and exons.

Figure 5 .
Figure 5. Circular visualization of the plastome comprehensive analyses of three Cerastium species (C.nigrescens, C. arcticum, and C. alpinum).The first outer track represents the chloroplast gene symbols.The second line track (A) shows haplotype diversity (π) values calculated for sliding window equal to 800 bp.The red part of the line plot depicts regions with the highest diversity (π > 0.015).Histograms (B) show comparative Ka/ Ks ratio values for Cerastium species, where blue, red, green and black colors depict the dominant Ka/Ks values in C. arcticum, C. nigrescens, equal for C. alpinum and C. nigrescens, and equal for all three species, respectively.Both scatter plots show the number of potential C > U and U > C editing sites within each plastid gene (C,D, respectively).The colors describe higher numbers of RNA editing sites in C. arcticum (blue points) and C. nigrescens and C. alpinum (green points) in comparison to other compared species.
92) was reached in 96.4% of the nodes (53 out of 55).The reconstructed tree supported the taxonomic position of the studied species and revealed the following relationships: all Silene species together with Lychnis wilfordii formed one clad which gathered only representatives of Sileneae tribe; a second clad was formed by the representatives of the Caryophylleae tribe i.e., five Dianthus species, three representatives of genus Gysophila and Psammosilene tunicoides; a third clad consisted of all representatives of Alsineae tribe (eight Pseudostellaria species, Stellaria dichotoma var.lanceolata, Myosoton aquaticum and all studied Cerastium species which formed one subgroup; a fourth clad consisted of eight Colobanthus species (Sagineae tribe) whereas Spergula arvensis (Sperguleae) and Paronychia argentea with Gymnocarpos przewalski (Paronychieae) form two separate branches.The most diverged position on the tree was occupied by A. thaliana which was used here as an outgroup (Fig. 6).Results of divergence time estimation suggested that the family Caryophyllaceae started to diversify ca.74.46 millions-years ago (Mya).Later, subsequent radiation within the family Caryophyllaceae occurred: ca.51.47 Mya Sperguleae tribe splits from the other sister clades and at ca. 48.32 Mya diversification of Sagineae tribe was observed (represented here only by one genus Colobanthus).At ca. 46.26 Mya the evolutionary paths of Alsineae and Arenariae tribes diverged from Caryophylleae and Sileneae tribes.At ca 42.26 Mya Alsineae and Arenariae split apart and c.a. 41.93 Mya Caryophylleae split from Sileneae.Diversification events at the lower taxonomic level e.g.within tribe Caryophylleae, Sileneae and Alsineae started at 30.87, 26.0 and 20.6 Mya, respectively.The genus Cerastium began to diversify at 3.66 Mya (Fig. 7).

Figure 6 .
Figure 6.Phylogenetic tree (cladogram) based on sequences of sheared 71 protein-coding genes from five Cerastium species and 54 other Caryophyllaceae representatives using Bayesian posterior probabilities (PP).Bayesian PP are given at each node.

Figure 7 .
Figure 7. Divergence time estimation of selected Caryophyllaceae taxa.The numbers next to the nodes represent the divergence time (Mya millions years ago). https://doi.org/10.1038/s41598-023-46017-y The research material consisted of three Cerastium species-C.alpinum, C. arcticum, and C. nigrescens.Fresh leaves of C. alpinum and C. arcticum were harvested from plants grown from seeds in a greenhouse (Department of Plant Physiology, Genetics and Biotechnology, University of Warmia and Mazury in Olsztyn, Poland).The seeds of C. alpinum were collected in 2020 in Babia Góra National Park (Poland) after obtaining permission from the Polish Ministry of Environment.In the case of C. arcticum, seeds were collected by Michał Węgrzyn from the Institute of Botany of Jagiellonian University in Kraków, Poland, during the Arctic expedition to Nicolaus Copernicus University Polar Station in Spitsbergen in 2012.In turn, C. nigrescens individuals were collected by Keith W. Larson from Climate Impacts Research Centre, Umeå University, Sweden, in Nuolja massif (Sweden) and delivered to Olsztyn in dried form.The species identification included analysis of both vegetative and generative organs.In the case of C. alpinum and C. arcticum identification was performed by Irena Giełwanowska, whereas C. nigrescens status was verified by Keith W. Larson.Voucher specimens of each studied species have been deposited in the Vascular Plants Herbarium of the Department of Botany and Nature Protection at the University of Warmia and Mazury in Olsztyn, Poland (OLS), under the following numbers: C. alpinum (No.OLS 33837), C. arcticum (No.OLS 33840) and C. nigrescens (No.OLS 33841).The photographs of the representatives of each studied species were provided as the supplementary material: C. alpinum (Supplementary Fig. https://doi.org/10.1038/s41598-023-46017-y

Table 1 .
Summary of chloroplast genome characteristics of studied Cerastium species.

Table 2 .
List of genes present in chloroplast genome of Cerastium.a Genes associated with Photosystem I.

Table 3 .
GenBank accession numbers and references for chloroplast genomes used in this study.Species list arranged alphabetically.